Crime measurement and prediction is an important and complex topic in city management. The police department wants to explore the distribution pattern of the crime using the past crime data, and then predict the potential crime risk in future. The prediction can be used to allocate the personal in a more “efficient” way, where allocate more and probably frequent police resource in under performed area to combat crime. According to Logan (2008), one of the crime collection method in USA is incident-based reporting. However, some arguments concerning the accuracy and equity of the reported crime data has emerging in recent year considering the following reasons:
Therefore, if the crime prediction is based on this type of inaccurate data, it may result in inefficient police allocating and potential biased policing.
According to Chicago’s incident of crime data, in 2017, the city had about 3,600 reported weapon violation concerning unlawful use of handgun. The distribution of the reported data shows that this type of crime is highly concentrated in several area of the city. Without further investigation of the data, people might draw the conclusion that handgun violation is most likely to occur in those hot spots in the future, thus more regulation or patron are reasonable. However, it it true? Can we fully trust this data and corresponding forecasting?
This project intends to use Chicago, a city suffered by notorious crime for long time, to explore the spatial distribution of the crime, and come up with a new crime predict method using spatial characteristics features which are more likely to portrait where is the crime, and without relaying of previous reported crimes (the current mainstream method). Model’s generalizing ability across the city will also be evaluated at the end, especially focusing on its crime prediction in different racial neighborhoods.
setwd("~/Desktop/Upenn-Y1/2019Fall/590/HW7_Geo_risk_prediction")
library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(raster)
library(wesanderson)
library(spatstat)
#library(trendsceek)
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
st_transform("ESRI:102271") %>%
dplyr::select(District = dist_num)
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
st_transform("ESRI:102271") %>%
dplyr::select(District = beat_num)
chicagoBoundary <-
st_read("/Users/jingzhang/Desktop/Upenn-Y1/2019Fall/590/HW7_Geo_risk_prediction/riskPrediction_data/chicagoBoundary.shp")%>%
st_transform("ESRI:102271")
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"),
mutate(policeBeats, Legend = "Police Beats"))
ggplot() +
geom_sf(data = bothPoliceUnits) +
facet_wrap(~Legend) +
labs(title = "Police adminstrative areas, Chicago") +
mapTheme()
#create fishnet
fishnet <-
st_make_grid(chicagoBoundary, cellsize = 500) %>%
st_sf()
# select only Chicago's net, add unique ID
fishnet <-
fishnet[chicagoBoundary,] %>%
mutate(uniqueID = rownames(.)) %>%
dplyr::select(uniqueID)
`
weapons<-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
filter(Primary.Type == "WEAPONS VIOLATION" &
Description == "UNLAWFUL POSS OF HANDGUN")%>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x, into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform("ESRI:102271") %>%
distinct()
crime_net <-
weapons %>%
dplyr::select() %>%
mutate(countWeapons = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countWeapons = ifelse(is.na(countWeapons), 0, countWeapons),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
This map visualizes the Weapon Violation (Unlawful Poss of Handgun) cases in Chicago in 2017. Notably, the cases are not evenly distributed in the city, and we can observe some clustering of handgun violations. For example, we can tell through the map that the north-central west part, as well as the south-central east part, seems to have high-concentrated handgun violations. However, it seems completely absent in the far northwest and southeast of the city. As some points are overlapped, some clues might be missed through this simple map, let’s visualize it a more accurate way.
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = weapons, colour="palevioletred1", size=0.02, show.legend = "point") +
labs(title= "Handgun Violation in Chicago, 2017")+
mapTheme()
Apart from plotting the incidents directly, we can use the fishnet base map using 500m*500m grid as the basic unit, and then measure the number of cases within each grid. The following map presents a powerful and pretty straightforward info-graph for us to analyze and understand the volume of crime data and its spatial distribution. What we can observe in the following fishnet map is generally consistent with the former map that there are two obvious handgun violation hot spots in Chicago. The fishnet map seems further to show that the crime in the northwest hot spot is more concentrated, and the south-central one seems more diffuse and even extends east to the coast.
ggplot() +
geom_sf(data = crime_net, aes(fill = countWeapons), lwd=0.1) +
scale_fill_viridis(option="cividis") +
labs(title = "Count of Handgun Violation by Fishnet in Chicago, 2017")+
mapTheme()
Two types of factors were selected in this project: risk factor and protective factor. The basic hypothesis of handgun violation is that, we assume this type of crime always happens in blight areas with many abandoned houses, abandoned cars, or empty lots because those objects can protect the crime from being detected, thereby are usually suitable for “crime”. Thus, we select some services records from 311 Service Requests from Chicago Data Portal and try to capture the built environment feature of “blight area”. Alcohol is another important trigger for the crime so we select the density of “liquor retail” as a risk factor. Apart from that, many American cities including Chicago suffered “center decline” for a long time, which means that although there are many commercial or shopping places in the city center, it’s also a crime-prone area at night. Thus, distance to “the Loop” city center was also selected as the risk factor. We also use some protective factors such as police stations, schools, and Broadband Innovation Zone, which is likely to provide more “eye on the street” and prevent potential crime. The potential features can be summaries as:
AlleyLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Alley-Lights-Out-Historical/t28b-ys7j/data") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Alley_Lights_Out")
abandonBuildings <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
mutate(year = substr(date_service_request_was_received,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Buildings")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
graffiti <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
filter(where_is_the_graffiti_located_ == "Front" |
where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
abandonCars <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
liquorRetail <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
PoliceStation <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Police-Stations/z8bn-74gv/data")%>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "PoliceStation")
head(PoliceStation)
treeDebris <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Tree-Debris-Historical/mab8-y9h3/data")%>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "treeDebris")
RodentBaiting <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-No-Duplicates/uqhs-j723")%>%
mutate(year = substr(Completion.Date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = Latitude, X = Longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Rodent_Baiting")
bussinessNoL <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Active/uupf-x98q")%>%
filter(CITY == "CHICAGO")%>%
filter(!str_detect(BUSINESS.ACTIVITY, "Liquor"))%>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "bussiness_without_Liquor")
school <- st_read("https://data.cityofchicago.org/api/geospatial/9zky-nrsy?method=export&format=GeoJSON")%>%
dplyr::select(Y = lat, X = long) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "school")%>%
dplyr::select( geometry,Legend)
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
# The Loop, Chicago’s downtown
neighborhoods_Loop <-
neighborhoods %>%
filter(name == "Loop")%>%
mutate(Legend = "City_center_Loop")%>%
dplyr::select(Legend)
Broadband_Innovation_Zones <-
st_read("https://data.cityofchicago.org/api/geospatial/jkfs-hwcr?method=export&format=GeoJSON") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Broadband_Innovation_Zones")
BIZPoint <-
Broadband_Innovation_Zones %>%
st_centroid()
Broadband_Innovation_Zones <- Broadband_Innovation_Zones%>%
dplyr::select(Legend)
vars_net <-
rbind(AlleyLightsOut, abandonBuildings, sanitation, graffiti, abandonCars, liquorRetail, treeDebris, RodentBaiting, bussinessNoL, PoliceStation, school,Broadband_Innovation_Zones, neighborhoods_Loop) %>%
st_join(., fishnet, join=st_within) %>%
st_set_geometry(NULL) %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit()
ggplot() +
geom_sf(data=chicagoBoundary) +
geom_sf(data = rbind(AlleyLightsOut, abandonBuildings, sanitation, graffiti, abandonCars,treeDebris,RodentBaiting,bussinessNoL,
liquorRetail, school, PoliceStation, neighborhoods_Loop,Broadband_Innovation_Zones),
size = .1) +
facet_wrap(~Legend, ncol = 4) +
labs(title = "Risk Factors") +
mapTheme()
This project provides two methods to engineer the feature to extract the most valuable information from the data. The first is count the number of item in each fishnet. The second type measures the distance from the center of each grid to the nearest item (nearest neighbor distance), which aims to avoid the downside of using pre-defined geography as the unit for defining the data (e.g. the fishnet grid).
vars_net.count.long <-
vars_net %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.count.long$Variable)
vars.count <- vars[c(1:3,5,7:13)]
mapList <- list()
for(i in vars.count){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.count.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(option="cividis",name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =4, top = "Count of Potenrial Features by Fishnet"))
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
# measure the distance from the centro of each grid to the nearest item
vars_net$Abandoned_Buildings.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonBuildings), 3)
vars_net$Abandoned_Cars.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonCars), 3)
vars_net$Graffiti.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(graffiti), 3)
vars_net$Liquor_Retail.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquorRetail), 3)
vars_net$Alley_Lights_Out.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(AlleyLightsOut), 3)
vars_net$Sanitation.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(sanitation), 3)
vars_net$TreeDebris.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(treeDebris), 3)
vars_net$RodentBaiting.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(RodentBaiting), 3)
vars_net$BussinessNoL.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(bussinessNoL), 3)
vars_net$PoliceStation.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(PoliceStation), 3)
vars_net$school.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(school), 3)
vars_net$BIZ.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(BIZPoint), 3)
vars_net.long.nn <-
vars_net %>%
dplyr::select(ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(option="cividis") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =4, top = "Nearest Neighbor Factors by Fishnet"))
As we mentioned the “city center decline” might also contributed to the crime, we thereby measure the euclidean distance from the centroid of The Loop (downtown) to each grid cell’s centre point. The Loop that this project is defined by the neighborhood data
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
# The Loop, Chicago’s downtown
loopPoint <-
neighborhoods %>%
filter(name == "Loop") %>%
st_centroid()
# meansure distance to downtown
vars_net$loopDistance =
st_distance(st_centroid(vars_net),loopPoint) %>%
as.numeric()
ggplot() +
geom_sf(data=vars_net, aes(fill=loopDistance)) +
scale_fill_viridis(option="cividis") +
labs(title="Euclidean distance to The Loop") +
mapTheme()
#final_net = vars_net + crime_net
final_net <-
left_join(crime_net, st_set_geometry(vars_net, NULL), by="uniqueID")
#add neighbourhood name and policeDistrict into final_net
final_net <-
st_centroid(final_net) %>%
st_join(., dplyr::select(neighborhoods, name)) %>%
st_join(., dplyr::select(policeDistricts, District)) %>%
st_join(., dplyr::select(policeBeats, District)) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
final_net.long.nn <-
final_net %>%
dplyr::select(Abandoned_Buildings.nn, Abandoned_Cars.nn, Graffiti, Liquor_Retail,
Alley_Lights_Out, Sanitation, loopDistance, treeDebris,
RodentBaiting.nn, BIZ.nn, BussinessNoL.nn, school.nn, PoliceStation) %>%
gather(AllFeature, value, -geometry)
final_net_nn <- unique(final_net.long.nn$AllFeature)
mapList <- list()
To take the spacial distribution of simple assault into consideration, we use Local Moran’s I to measure whether certain net’s count of “simple assault” is related to its surrounding nets. Higher Ii implies that this is a cluster of high “simple assault” rate, or called Simple Assault Hotspot.
According to the Local Morans’I & P value map, we can get a statistically significant “simple assault” hotspot map. Figure shows the nearest distance of each fishnet to the significant “simple assault” which is the most important spatial structure which will be added into our model.
final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
set.ZeroPolicyOption(TRUE)
## [1] FALSE
final_net.localMorans <-
cbind(
as.data.frame(localmoran(final_net$countWeapons, final_net.weights)),
as.data.frame(final_net, NULL)) %>%
st_sf() %>%
dplyr::select(Handgun_Violation_Count = countWeapons,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
gather(Variable, Value, -geometry)
#loop for map
vars_localM <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars_localM){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
scale_fill_viridis(option="cividis") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(varList, ncol = 2, top = "Local Morans'I Statistics of Handgun Violation"))
final_net <-
final_net %>%
mutate(handgun.isSig = ifelse(localmoran(final_net$countWeapon,
final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
mutate(handgun.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(
filter(final_net, handgun.isSig == 1))), 1 ))
ggplot() +
geom_sf(data = final_net, aes(fill = handgun.isSig.dist),lwd=0.1) +
scale_fill_viridis(option="cividis") +
labs(title = "Distance to Highly Significant Handgun Violation Hotspots") +
mapTheme()
The following correlation plots show the between the selected features (risk & protected) and the number of handgun violations. Overall, it shows that most of the features that we selected have a strong correlation with the number of handgun violations, while it’s obvious that “count in each grid” and “nearest neighbor distance” are colinear, so for the final selection, the one with weaker correlation will be removed from the following model.
correlation.long <-
st_set_geometry(final_net, NULL) %>%
dplyr::select(c(1,4:29), -c("PoliceStation","City_center_Loop","Broadband_Innovation_Zones"))%>%
#dplyr::select(countWeapons, Abandoned_Buildings.nn, Abandoned_Cars.nn, Graffiti, Liquor_Retail, Alley_Lights_Out.nn, Sanitation.nn, loopDistance, TreeDebris.nn, RodentBaiting.nn, BIZ.nn, BussinessNoL.nn, school.nn, PoliceStation) %>%
gather(Variable, Value, -countWeapons )
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countWeapons, use = "complete.obs"))
##### need count
ggplot(correlation.long, aes(Value, countWeapons)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "palevioletred1", lwd=1) +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Handgun Violation count as a function of risk or protective factors")
ggplot(final_net, aes(countWeapons)) +
geom_histogram(binwidth = 1) +
labs(title = "Distribution of Handgun Violation by grid cell")
We use four types of cross validation to check the errors: 1. random k-fold, just risk factor 2. random k-fold, risk factor + spatial structure (distance to the assault hotspot) 3. LOGO-CV (Leave-one-group-out cross-validation), just risk factor 4. LOGO-CV (Leave-one-group-out cross-validation), just risk factor + spatial structure (distance to the assault hotspot)
reg.vars <- c("Abandoned_Buildings.nn" , "Abandoned_Cars.nn" , "Graffiti" , "Liquor_Retail" , "Alley_Lights_Out.nn" , "Sanitation.nn" , "loopDistance" , "TreeDebris.nn" , "RodentBaiting.nn" , "BIZ.nn", "BussinessNoL.nn" , "school.nn" , "PoliceStation" , "school" , "PoliceStation.nn")
reg.ss.vars <- c("Abandoned_Buildings.nn" , "Abandoned_Cars.nn" , "Graffiti" , "Liquor_Retail" , "Alley_Lights_Out.nn" , "Sanitation.nn" , "loopDistance" , "TreeDebris.nn" , "RodentBaiting.nn" , "BIZ.nn", "BussinessNoL.nn" , "handgun.isSig" , "handgun.isSig.dist" , "school.nn" , "PoliceStation" , "school" , "PoliceStation.nn")
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
# cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(countWeapons ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countWeapons",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countWeapons, Prediction, geometry)
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countWeapons",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countWeapons, Prediction, geometry)
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countWeapons",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countWeapons, Prediction, geometry)
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countWeapons",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countWeapons, Prediction, geometry)
Through the comparison of 4 prediction and the observation of assault, we find that the model with spatial-related variables predict better. While the result beween k-fold and LOGO cross-validation method is not such obvious in this study (erro compare)
reg.summary <-
rbind(
mutate(reg.cv, Error = countWeapons - Prediction,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = countWeapons - Prediction,
Regression = "Random k-fold CV: Spatial Structure"),
mutate(reg.spatialCV, Error = countWeapons - Prediction,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = countWeapons - Prediction,
Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
st_sf()
grid.arrange(
reg.summary %>%
ggplot() +
geom_sf(aes(fill = Prediction), lwd=0.1) +
facet_wrap(~Regression) +
scale_fill_viridis(option="cividis") +
labs(title = "Predicted Handgun Violation by Regression",
subtitle = "Figure 1") +
mapTheme() + theme(legend.position="bottom"),
filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
ggplot() +
geom_sf(aes(fill = countWeapons), lwd=0.1) +
scale_fill_viridis(option="cividis") +
labs(title = "Observed Handgun Violation\n",
subtitle = "Figure 1") +
mapTheme() + theme(legend.position="bottom"), ncol = 2)
filter(reg.summary) %>%
ggplot() +
geom_sf(aes(fill = Error), lwd=0.1) +
facet_wrap(~Regression) +
scale_fill_viridis(option="cividis") +
labs(title = "Handgun Violation errors by Regression") +
mapTheme()
This table shows that random k-fold cv with spatial structure variables is the most suitable method for this study since it shows the smallest Mean Abosulute Error and standard deveation
st_set_geometry(reg.summary, NULL) %>%
group_by(Regression) %>%
summarize(MAE = round(mean(abs(Prediction - countWeapons), na.rm = T),2),
SD_MAE = round(sd(abs(Prediction - countWeapons), na.rm = T),2)) %>%
kable(caption = "Table: MAE by regression") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "palevioletred1") %>%
row_spec(4, color = "black", background = "palevioletred1")
| Regression | MAE | SD_MAE |
|---|---|---|
| Random k-fold CV: Just Risk Factors | 1.26 | 1.96 |
| Random k-fold CV: Spatial Structure | 1.19 | 1.87 |
| Spatial LOGO-CV: Just Risk Factors | 1.29 | 2.02 |
| Spatial LOGO-CV: Spatial Structure | 1.24 | 1.99 |
We use race to check the generalzability of this model, it shows that the model is not such general for different race although the difference drops significantly when we add spatial related variable. All models shows that we overpredict non-white communities which means that the count of assault that we predicted in non-white majority neighbourhood is larger than the obsrevation; and all models underpredict white mojority communitites. However, what we can argue is that the difference maybe unavoidable since this bias is embed in the crime dataset.
#census_api_key("3929725aba6e199105d8e902356ad2fb95af811d",install=TRUE,overwrite=TRUE)
#readRenviron("~/.Renviron")
tracts17 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2017, state=17, county=031, geometry=T) %>%
st_transform("ESRI:102271") %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[neighborhoods,]
final_reg <-
filter(reg.summary) %>%
mutate(uniqueID = rownames(.))
final_reg.tracts <-
st_join(st_centroid(final_reg), tracts17) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(final_reg, uniqueID)) %>%
st_sf() %>%
na.omit()
st_set_geometry(final_reg.tracts, NULL) %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Mean Error by neighborhood racial context") %>%
kable_styling("striped", full_width = F)
Comparing the traditional hotspot map (by kernel density) to our prediction hotspot (Random k-fold CV: Spatial Structure), we found that the hotspot of the high-risk area in our model is much more detail than the traditional one, which can help police to allocate police much easier.
# Compute kernel density
weapons_ppp = as.ppp(st_coordinates(weapons), W = st_bbox(final_net))
weapons_KD = density.ppp(weapons_ppp, 1000)
# Convert kernel density to grid cells taking the mean
weapon_KDE_sf <- as.data.frame(weapons_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
#Mutate the Risk_Category field as defined below.
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
#Bind to a layer where test set crime counts are spatially joined to the fisnnet.
cbind(
aggregate(
dplyr::select(weapons) %>% mutate(weaponCount = 1), ., length) %>%
mutate(weaponCount = replace_na(weaponCount, 0))) %>%
#Select the fields we need
dplyr::select(label, Risk_Category, weaponCount)
weapon_risk_sf <-
filter(final_reg, Regression == "Random k-fold CV: Spatial Structure") %>%
st_as_sf(crs = st_crs(weapon_KDE_sf)) %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(weapons) %>% mutate(weaponCount = 1), ., length) %>%
mutate(weaponCount = replace_na(weaponCount, 0))) %>%
dplyr::select(label,Risk_Category, weaponCount)
rbind(weapon_KDE_sf, weapon_risk_sf) %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(weapons, 1500), size = .1, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(option="cividis", discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="Relative to test set points (in black)") +
mapTheme()
This part calculated both models’ rate of predicted crime to observed crime by risk catogory. Although for 1% to 90% risk category, traditional model performs better than our model, our model can capture more high-risk crime (90%-100%)
rbind(weapon_KDE_sf, weapon_risk_sf) %>%
st_set_geometry(NULL) %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countassault = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crimes = countassault / sum(countassault)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(discrete = TRUE)+
labs(title = "Risk prediction vs. Kernel density, 2018 Thefts")
Generally, this algorithm uses “simple assault” crime data and additional objective risk and protective factors to predict future potential crime. I would recommend this algorithm since it predicts better than the traditional method. This method tries to mitigate the impact of data bias and try to be accurate & general for difference race community. For further improvement, more related independent variables need to be included in this model to perform a better prediction